[tx] Left align generated tokens in decoding#933
Merged
pcmoritz merged 7 commits intoNovaSky-AI:mainfrom Jan 24, 2026
Merged
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request refactors the Key-Value cache handling to support per-sequence cache positions, which is a crucial change for enabling efficient left-aligned batch decoding. The changes are consistently applied across the Llama3 and Qwen3 model implementations, as well as the generator utilities. The core logic of using per-sequence positions for updating the KV cache and attention mask seems correct. I've found one minor issue with a duplicated line of code that should be removed. Otherwise, the changes look solid.
Comment on lines
161
to
162
| # Pad KV cache and attention mask to max_length | ||
| kv_cache = kv_cache.pad_to_length(max_length) |
Contributor
tanmaysachan
pushed a commit
to tanmaysachan/SkyRL
that referenced
this pull request
Jan 25, 2026
This PR writes the new decoded token into the kv cache in such a way that the whole sequence is left aligned. This is needed so that the CUDNN attention NovaSky-AI#879 truly works without attention mask.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR writes the new decoded token into the kv cache in such a way that the whole sequence is left aligned. This is needed so that the CUDNN attention #879 truly works without attention mask.